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This  paper  has  been  submitted  to  Workshop  10,  Computer  Arithmetic,  Euro- 
Par’96,  August  27-29,  1996,  Lyon,  France.  From  the  Workshop  10  Announcement: 

This  workshop  aims  at  exploring  the  aspects  of  computer  arith- 
metic related  to  the  design  of  globally  parallel  architectures.  The 
arceis  in  which  contributions  are  sought  include  number  systems, 
error  analysis,  floating-point  and  level-index  arithmetic,  high-per- 
formance architectures  (adders,  multipliers,  dividers,  etc.),  appli- 
cation-specific architectures,  hardware  and  software  arithmetic  al- 
gorithms, fault-tolerant  arithmetic,  GCD  and  other  operations  on 
integers,  error  detection  and  correction,  and  physical  design  of  pro- 
cessing units  (FPGA  to  full  custom  implementation,  optimal  latch 
insertion,  wave  pipelining,  delay  optimization,  etc.). 

The  program  committee  consists  of  Jean-Marc  Delosme,  Yale 
University  (Chair);  Luigi  Dadda,  Politecnico  di  Milano,  Italy  (Vice- 
Chair);  Peter  Kornerup,  Odense  University,  Denmark  (Vice-Chair); 
Jean-Michel  Muller,  Ecole  Normale  Superieure  de  Lyon,  France 
(Local  Chair). 

All  accepted  papers  will  appear  in  the  proceedings  of  Euro- 
Par’96,  to  be  published  by  Springer- Verlag  in  the  Lecture  Notes  in 
Computer  Science  series. 

The  paper  is  subject  to  revision  for  compliance  with  the  recommendations  and 
requirements  of  referees  and  editors. 


■ " ' Vj 


■!,'  ‘’T^  "^--..P  w- 

.■V' 


:^,'  *A 


' '’'k'."?>iVJ 


iXifi 


(«!■ 


SI 


-0103  ,3tjtom4iiaA  .laiwimoO  .01  <^«(kil«Vf'''Q^- a»wJ 

.■liwmaiii/onaA  01 


' '•' 

-a<fq»  (« 

no  eaoiiinwnj<^  tadio  i»a»  <200  ,;;jiifii^iiU!t*  ^isto3t»|ie«-liujiC  '^•  ^il 

-oKjr^'o  c(ak'%t»b  IfWs^ilQ  bm 

^34»i , te'fitK^  ii/':«jtfitiq3T54^^  i ■ as^<^»  stfiHi? 

’’■''  .(..>>3::,=cdisAatit«ltq«'':t-&  

»I#Y  .am^sc'sQ  <Wf«tx:ixo5  9ijttl<n«3N&o 

y I ^^wKnai^'T  »j* v.  r*^  ' -"TT “' * r:\'*’^*^'^a 

•owa  lo  aRnilisasctq  ara  ai  i^qqA  lliss  itj*q«q  liA 


ni  ejljbl^  ffi  iMitei^uq  iwl  ^ ,60*1*^.^ 


biw  efioiJabflsw^pwT  udi  dfiyr  Srosj^jiWfMi, ; 


'■V( 


4-^  fV , 


i 


iw^- 


r k.X  tmJiT  'l 


.-■  1 


u.  ji.  .!■■ 


S\ 


ik 


■ 3"'.. 


i 

■:  ■ ■ '*"••■  ,1,^ 
•iSkii^t 

''■4l^^;jji 


t'r  P ’ 

"a  *,  lii  r 


[i£'  i:''^!El 


!5''  \i' 


>^v-1 


BASIC  LINEAR  ALGEBRA  OPERATIONS  IN  SLI  ARITHMETIC 


MICHAEL  A.  ANUTA,  DANIEL  W.  LOZIER, 
NICOLAS  SCHABANEL,  AND  PETER  R.  TURNER 


Abstract.  Symmetric  level-index  arithmetic  was  introduced  to  overcome  rec- 
ognized limitations  of  floating-point  systems,  most  notably  overflow  and  un- 
derflow. The  original  recursive  algorithms  for  arithmetic  operations  could  be 
parallelized  to  some  extent,  particularly  when  applied  to  extended  sums  or 
products,  and  a SIMD  software  implementation  of  some  of  these  algorithms  is 
described.  The  main  purpose  of  this  paper  is  to  present  parallel  SLI  algorithms 
for  arithmetic  and  basic  linear  algebra  operations. 


1.  Introduction 

This  paper  reports  on  a continuing  project  to  develop,  implement  and  apply 
parallel  algorithms  for  SLI  (symmetric  level-index)  arithmetic.  The  algorithms  are 
being  developed  with  a view  toward  a possible  future  implementation  in  hardware 
but  at  this  stage  they  are  being  coded  for  a particular  SIMD  (single  instruction, 
multiple  data)  computer  system,  a DEC  MasPar  MP-1^.  The  algorithms  cover 
individual  arithmetic  operations  and  extensions  to  the  BLAs  (basic  linear  algebra 
operations)  such  as  the  product  of  a scalar  times  a vector,  the  scalar  product  of 
two  vectors,  and  the  ‘saxpy’  operation  [7]  consisting  of  a scalar  times  a vector  plus 
cinother  vector.  All  these  operations,  especially  the  BLAs,  can  benefit  from  the 
parallel  execution  of  appropriate  algorithms. 

A parallel  implementation  of  individual  SLI  arithmetic  operations  for  the  MP-1 
was  developed  in  1995  by  Schabanel  under  the  direction  of  Turner  [11].  This  is  part 
of  an  envisioned  ‘computer  arithmetic  laboratory’  which  will  use  the  MP-1  to  im- 
plement different  kinds  of  computer  arithmetic  and  compare  them  on  representative 
numerical  problems;  see  Anuta,  Lozier  and  Turner  [l]. 


This  paper  has  been  submitted  to  Workshop  10,  Computer  Arithmetic,  Euro-Par’96,  August 
27-29,  1996,  Lyon,  Prance. 

Cray  Reseetrch  Inc.,  Suite  600,  4041  Powder  Mill  Road,  Calverton,  MD  20705,  e-mail; 
mike.anuta@cray.com,  voice:  1-301-595-2692,  fax:  1-301-595-2647. 

Applied  and  Computational  Mathematics  Division,  National  Institute  of  Standaurds  and  Tech- 
nology, Gaithersburg,  MD  20899,  e-mail:  dlozier@nist.gov,  voice:  1-301-975-2706,  fax:  1-301-990- 
4127. 

Ecole  Normale  Superieure  de  Lyon,  Lyon,  Prance,  e-mail:  Nicolas.Schabanel@ens.ens-lyon.fr. 

Mathematics  Department,  United  States  Naval  Academy,  Annapolis,  MD  21402,  e-mail: 
prt@sma.usna.navy.nul,  voice:  1-410-293-6732,  fax:  1-410-293-4883. 

*^This  computer  system  is  identified  in  this  paper  to  foster  understanding.  Such  identification 
does  not  imply  recommendation  or  endorsement  by  any  of  the  institutions  represented  by  the 
authors,  nor  does  it  imply  that  the  MP-1  is  necessarily  the  best  available  for  the  purpose. 
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The  MP-1  is  located  at  the  U.  S.  Naval  Academy.  Its  processor  array  has  a 
4-bit  architecture  with  4096  processors  connected  in  a 64  x 64  square  array.  The 
communication  network  consists  of  a 4-bit  bus  connecting  the  rows  and  columns 
of  the  array  with  toroidal  wraparound  to  provide  edge  as  well  as  interior  elements 
with  exactly  four  nearest  neighbors.  Each  processor  executes  instructions  on  4- 
bit  operands  using  its  own  registers  and  memory.  In  typical  SIMD  fashion,  the 
processors  are  separated  at  any  instant  into  an  active  set  and  an  inactive  set,  and 
an  instruction  stream  issued  from  a special  control  processor  called  the  ACU  (array 
control  unit)  is  executed  simultaneously  on  all  processors  in  the  active  set. 

The  MP-1  is  programmed  using  a version  of  ANSI  C called  MPL^.  MPL  pro- 
grams are  compiled  on  a DEC  (Digital  Equipment  Corporation)  workstation  that  is 
connected  directly  to  the  ACU  and  processor  array.  Arithmetic  on  integer  operands 
consisting  of  8,  16,  32  or  64  bits,  or  FLP  (floating-point)  operands  consisting  of  32 
or  64  bits,  is  built  up  in  each  active  processor  using  4-bit  operations  according 
to  instructions  issued  by  the  ACU.  The  MPL  data  types  corresponding  to  these 
operand  types  are  char,  short,  long,  long  long,  float  and  double.  Since 
one  MPL  operation  or  4096  take  the  same  time,  the  power  of  MP-1  computing  de- 
pends on  algorithms  with  high  data  parallelism.  This  is  typical  of  SIMD  computing 
in  general. 

The  paper  is  organized  as  follows.  After  a review  of  the  SLI  representation  in 
Section  2,  we  review  the  original  arithmetic  algorithms  and  our  MasPar  implemen- 
tation in  Section  3.  This  implementation  simulates  individual  SLI  operations  with 
a modest  degree  of  parallelism  using  the  MPL  data  type  long  long.  In  Section 
4 we  present  modified  algorithms  that  are  more  parallelizable,  particularly  in  a 
SIMD  architecture.  The  original  and  modified  algorithms  can  all  be  executed  in 
finite-precision  fixed-point  arithmetic.  However,  it  should  be  noted  here  that  error 
analyses  of  the  original  algorithms  in  earlier  papers  provide  an  accurate  determina- 
tion of  the  number  of  guard  bits  needed  to  support  a specified  precision,  at  least  for 
individual  operations.  The  number  of  guard  bits  is  modest.  Corresponding  error 
analyses  of  the  modified  algorithms  will  be  the  subject  of  future  work.  In  the  final 
section  of  this  paper  the  BLAs  needed  for  Gauss  elimination  will  be  developed. 

2.  The  SLI  Computer  Arithmetic  System 

The  LI  (level-index)  representation  of  real  numbers,  its  application  to  computer 
arithmetic,  and  recursive  algorithms  for  arithmetic  operations,  were  introduced  in 
1984  and  1987  by  Clenshaw  and  Olver  [3,  4].  The  symmetric  form  of  the  LI  system 
was  developed  in  1988  by  Clenshaw  and  Turner  [6].  See  also  the  survey  [5]. 

Define  the  infinite  sequence 

(1)  = = 

If  X is  an  arbitrary  real  number,  define 

(2)  level(X)=^  Ei  <\X\  < Ei+ul=  Q, 1,2, .. .. 
ll  I = level (X),  define 

(3)  index(X)  = In^^^  \X\ 


^High  Performance  Fortran  is  present  also  but  only  MPL  is  of  interest  in  tins  paper. 
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where 

(4)  1X|  = |X1,  = 1X1,^  =0,1,2,.... 

The  generalized  logarithm,  defined  as 


(5)  V’('X’)  = level(X)  + index(X), 

maps  [0,  oo)  onto  itself  monotonically  and  so  it  is  invertible  on  this  interval.  The 
inverse,  a generalized  exponential  function,  is  defined  recursively  by 


(6) 


if  0 < X < 1, 
if  X > 1; 


compare  (1).  Properties  of  generalized  exponential  and  logarithmic  functions  are 
given  in  [2].  For  example,  <^,  V*  € C^(0,oo). 

For  computer  arithmetic  with  w bits  per  word,  LI  representations  are  stored  in 
two  fields:  (sign  bit,  generalized  logarithm).  That  is,  for  an  arbitrary  nonzero  X, 


(7)  X = sx<f>ix),  x = V’(X), 

and  X is  stored  in  ordinary  {w  — l)-bit  fixed-point  format.  The  sequence  (1)  grows 
very  quickly.  The  first  few  terms  are 


Eq  = 0,  E\  = 1,  E2  = e ~ 2.72,  =:  e*  ~ 15.2, 

Ei  = e"'  « 3,  810,  000  « 10®'®®  « 

Es  w 20^’®®°’°°°  ~ 2®’®°®’°°°  ~ 2^^^  ^ 

^ 1^1,660,000  06, 600, 000 

Ee « 10^°  « 2^  . 

As  a consequence  of  this  growth,  Lozier  and  Olver  proved  [8]  that  the  finite  set  of 
LI  representations  with  level  not  more  than  6,  and  w (the  word  length)  not  more 
than  5.5  million  or  so,  is  closed  under  individual  arithmetic  operations  (excluding, 
of  course,  division  by  zero).  Therefore,  taking  3 bits  in  the  integer  and  w — A bits 
in  the  fractional  part  of  x always  suffices  in  practice. 

The  ri;-bit  LI  representation  has,  in  effect,  an  ‘accumulation  point’  at  infinity, 
but  not  at  zero.  The  lo-bit  SLI  representation 


(8) 

(9) 

(10) 
(11) 


(12)  X = sx<i>{xy^ , X = V'(max(X,  X"'))  = ./.(X’’^  ), 


where  X 91^  0 and  x is  stored  in  {w  — 2)-bit  fixed-point  format,  allows  for  an 
‘accumulation  point’  at  both  infinity  and  zero.  As  a consequence  of  the  closure 
property  cited  above,  the  SLI  system  is  free  of  both  overflow  and  underflow. 

Figures  1 and  2,  taken  from  [10],  compare  SLI  and  typical  FLP  representations 
for  w = 32,  64, 128;  see  also  [8].  The  FLP  overflow  limits  for  these  word  lengths  are 
10®®,  10®®®  and  10^®®^,  approximately,  eissuming  significand  lengths  of  23,  52  and 
112  bits.  The  SLI  index  lengths  are  27,  59  and  123  bits.  The  vertical  axis  in  each 
figure  measures  ‘significance’ 


- logic 


X+  -X 


X 
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Figure  1.  SLI  vs.  FLP  Figure  2.  SLI  vs.  FLP 

for  w = 32  bits.  for  w = 32,  64, 128  bits. 


where  X"*"  is  the  successor  of  X in  a particular  computer  arithmetic.  The  horizontal 
axis  measures  log^gX.  As  expected,  the  FLP  significance  is  constant  ‘on  average’ 
but  exhibits  an  oscillation  due  to  binary  normalization.  The  SLI  significance  ex- 
hibits no  such  oscillation.  The  SLI  significance  is  better  at  first  but  it  decreases 
slowly  and  crosses  over  the  FLP  significance.  The  FLP  systems  fail  beyond  their 
overflow  limits  while  the  SLI  systems  retain  useful  significance  far  beyond  the  limits 
of  the  graphs.  The  initial  superiority  of  the  SLI  significance  is  due  to  the  index 
having  a greater  number  of  bits  than  the  FLP  significand  for  a given  word  length. 

The  LI  and  SLI  arithmetic  algorithms  apply  not  only  to  individual  operations 
but,  through  appropriate  adaptations,  to  expressions  of  the  form 

Xi±X2±---±X;i,, 

These  adaptations  reduce  the  number  of  recursive  steps  in  comparison  to  building 
up  the  expressions  by  individual  arithmetic  operations.  Further,  the  computations 
can  be  arranged  to  take  advantage  of  parallel  computing  facilities  since  the  algo- 
rithms generate  N sequences  which  are  completely  independent  of  one  another. 

The  interest  in  SLI  arithmetic  stems  from  its  potential  for  simplifying  computer 
programming.  Because  of  its  ability  to  represent  extremely  large  numbers  and  their 
reciprocals  in  a small  number  of  bits,  the  vexing  overflow  and  underflow  problems  of 
FLP  systems  are  avoided  completely.  Software  engineering  experience  shows  that 
defensive  coding  artifices  which  have  been  developed  to  guard  against  overflow  and 
underflow,  such  as  the  ones  described  in  [9],  add  significantly  to  the  cost  of  creating 
and  maintaining  robust  software. 


3.  A MasPar  Implementation  of  SLI  Arithmetic 

In  this  section  we  review  the  original  recursive  algorithms  given  in  [6].  Then  we 
describe  our  recent  implementation  on  the  DEC  MasPar  MP-1. 
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3.1.  The  Original  SLI  Algorithms.  The  problem  is  to  solve  the  equation 

(13)  Z = sz<j>{zy^  = sx(i>{xy^  o SY<t>{yy^  =XoY 

for  sz,  Tz,  z given  sx,sy,  rx,  ry,  x,  y,  o.  For  simplicity  of  presentation  we  restrict^ 

(14)  Sz  = sx  = sy  = 1,  X >Y  > 0,  o G {+,  — , ■,  /}.  . 

We  confine  our  attention  here  to  the  additive  operations,  for  which  there  are  three 


cases: 

(16) 

large 

4>{zy=^ 

= ^{x)  ± <i>{y), 

(16) 

mixed 

4>{zy- 

= (t>{x)  ± (?i(y)"\ 

(17) 

small 

4>{zy^ 

In  all  cases  rz  — rx  except  possibly  in  large  subtraction,  mixed  subtraction,  or  small 
addition.  These  exceptional  cases  have  been  called  flipover  cases. 

Let  £x  = level(A’), /x  = index(X)  and  similarly  for  iy  ,£z,  fr,  fz-  The  algo- 
rithm generates  the  a-  and  c-sequences 

(18)  aj  = l/<j){x  - j),  Cj  = (f){z  - j)/<i>{x  - j), 

the  appropriate  form  of  the  6-sequence 


(19) 


4>{y  - - j)  (large), 


i/^(y-;) 


(mixed). 


^(t>{x  - i)/<i>{y  - j)  (small), 
and  in  some  situations  the  /i.-sequence 
(20)  ' * hj  = <i>{z  - j). 

The  a-  and  6-sequences  are  generated  by 


(21) 

- e , a 

I 

II 

Oi 

1 

II 

ix  - 

-1,.. 

-.1) 

and 

(22) 

large 

b,-l  = 

U 

= iY 

- 1,- 

.,1), 

(23) 

mixed 

hr-i 

fei-1  = 

u 

= iY 

-1,. 

.,1), 

(24) 

small 

^lx-1 

1 

1 

II 

^^,6,_i  = 

{j 

= ix 

- 1,. 

..,1). 

Then 

the  starting  value  for  the  c- 

sequence 

LS 

1 ± 60 

(large). 

(25) 

c = < 

1 ± aobo 

(mixed). 

^ 1 ± 60 

(small). 

® All  these  restrictions  are  removable  by  basic  properties  of  the  SLI  representation  without  any 
need  for  additional  rectrrsive  algorithms. 
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In  the  small  case,  c is  the  reciprocal  of  Cq  as  defined  in  (18).  Observe  that,  to  a 
considerable  extent,  the  o and  6-sequences  are  independent,  and  that  a recurrence 
like  (21)  can  be  used  to  compute  exp(— <^(y  — -^x))  in  (24). 

With  the  observation  that 

(26)  = c<f>{xY^  =.  cjaYY 

we  see  that  flipover  occurs  when  c < oq  in  the  large  and  mixed  cases,  and  when 
aoc  > 1 in  the  small  case.  Then  in  the  small  case,  z = 1 -I-  In  aoc  and  the  algorithm 
is  complete.  In  the  other  two  cases  we  generate  the  /i-sequence  from 

(27)  /ii  = — Inc/oo,  /ij=lnh.j_i  (j  = 2,3, ...) 

until  hj  € [0, 1),  then  we  set  z = j + hj  and  the  algorithm  is  complete. 

Now  suppose  flipover  does  not  occur.  If  = 1 then  we  generate  the  h-sequence 
and  z as  above,  but  with  the  starting  value  hi  = fx+'^x  Inc.  If  ix  > I we  generate 
the  c-sequence  from 

(28)  Cl  = 1 -h  r-xci  In  c,  Cj+i  = 1 + aj+i\ncj  (>=1,2,...) 
until  either 

(1)  Cj  < aj  and  j < ix  — f,  which  implies  z = j + Cj/aj  and  the  algorithm  is 
complete,  or 

(2)  j = ix  — 1 and  Cj  > aj,  which  requires  the  generation  of  the  /i-sequence 
and  z as  above,  but  with  the  starting  value  hi^  = fx  + lnc^_,j._i. 

A linearized  error  analysis  in  [6]  considers  the  ‘working  precisions’  needed  to  limit 
the  rounding  errors  in  the  algorithms  presented  above  to  the  size  of  the  inherent 
errors;  see  also  [4].  The  analysis  shows,  for  a word  length  w = 32  bits,  it  suffices  to 
compute  and  store  all  sequences  to  6 bits  before  and  36  bits  after  the  binary  point. 

3.2.  A MasPar  Implementation.  An  early  decision  to  be  made  in  designing 
an  arithmetic  unit,  whether  in  software  or  hardware,  is  how  the  exp  and  In  func- 
tions should  be  computed.  The  utility  of  the  CORDIC  (coordinate  rotation  digital 
computer)  approach  has  been  proven  in  handheld  calculators  [12].  Therefore  this 
technique  was  chosen  and  programmed  using  the  64-bit  integer  data  type  long 
long  even  though  use  of  64-bit  FLP  library  routines  present  in  the  MP-1  would 
have  saved  some  effort.  It  will  be  seen  that  CORDIC  algorithms  make  effective  use 
of  SIMD  parallel  architecture. 

CORDIC  algorithms  generate  a sequence  of  real  triples  {xj,  yj,  Zj)  from  a linear 
flrst-order  recurrence  relation  that  is  defined  in  terms  of  a prespecified  real  sequence 
€j.  As  one  component  converges  to  zero,  the  other  components  converge  to  specific 
function  values  as  determined  by  the  sequence  €j  and  initial  conditions  {xo,yo,  zq). 

Our  implementation  provides  the  functions  A exp(±t),  exp((t  — 1)/A)  and  Alnt 
for  special  and  restricted  ranges  of  the  arguments.  For  the  first  of  these  functions, 
A enters  only  into  the  initial  condition,  and  the  parallel  capability  of  the  MP-1  is 
not  applicable.  For  the  second  and  third,  our  CORDIC  algorithm  multiplies  every 
term  of  the  sequence  €j  by  A.  A certain  number  of  processors,  say,  is 

allocated  to  perform  this  tcisk.  This  is  called  a CORDIC  cluster. 

Since  the  a-  and  6-sequences  are  independent,  and  each  requires  a CORDIC 
evaluation,  an  individual  SLI  additive  or  multiplicative  operation  is  programmed 
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to  use  two  CORDIC  clusters.  The  remainder  of  the  processor  array  is  used  to  repli- 
cate this  structure,  so  that  SLI  arithmetic  operations  can  be  executed 

simultaneously. 

Our  MasPar  implementation,  accordingly,  provides  the  C functions  sli_add, 
sli_sub,  sli_mul,  slijdiv,  sli_op  in  which  the  first  four  accept  scalar  argu- 
ments, and  the  last  accepts  vector  arguments  and  an  operation  symbol.  Compari- 
son, type  conversion,  input  and  output  operations  are  provided  also. 

The  effect  of  varying  the  number  of  processors  per  CORDIC  cluster  was  studied 
in  [11].  For  single  operations,  allowing  Ncokdic  to  be  nonintegral,  the  minimum 
execution  time  occurred  with  20  processors.  For  vector  operations,  assuming  the 
vector  length  fits  the  array  exactly,  the  minimum  time  occurred  at  the  maximum 
vector  length,  which  corresponds  to  only  one  processor  per  CORDIC  cluster.  The 
compromise  IVcordic  = 3 was  recommended  for  the  MP-1  because  interprocessor 
communication  within  groups  of  16  processors  is  especially  efficient. 

The  MasPar  ‘computer  arithmetic  laboratory’  [1]  is  intended  to  be  somewhat 
indicative  of  tradeoffs  that  could  be  expected  in  hardware  implementations  of  novel 
computer  arithmetic  systems.  Although  communication  between  neighboring  SIMD 
processors  is  not  directly  comparable  to  communication  in  hardware,  the  results  de- 
scribed above  show  that  parallel  techniques  can  be  used  to  speed  up  SLI  arithmetic 
operations. 


4.  Modified  SLI  Arithmetic  Algorithms 

In  this  section  we  describe  modifications  of  the  previous  algorithms  which  are 
well-suited  to  SIMD  parallel  implementation.  The  modified  algorithms  can  also 
be  used  to  advantage  in  a serial  implementation.  The  modified  algorithms  for  SLI 
addition  and  subtraction  were  first  presented  in  [l].  They  are  reviewed  briefly  here 
for  completeness  and  ease  of  reference  and  then  extended  to  multiplication,  division 
and  several  basic  linear  algebra  operations. 

4.1.  The  Addition  and  Subtraction  Algorithms.  We  retain  the  restrictions 
(14)  of  the  preceding  section.  Then  the  basic  problem  of  SLI  addition  or  subtraction 
is  to  find  z and  its  associated  sign  = ±1  such  that 

(29)  Z = <f>{zY^  = <f>{xy^  ± =X±Y. 

We  must  consider  the  cases  (15-17). 

The  algorithm  begins  in  the  same  way  for  all  cases.  As  before,  we  denote  the 
level  and  index  of  X,Y  by  lx,  fx]W,  fr  respectively  so  that  lx  = [x],Iy  = [y], 
fx,fY  e [0, 1)  and 

(30)  X = lx  + fx,  y = Iy  + fy  ■ 

In  addition  to  the  o-sequence  in  (18),  which  we  now  denote  as  aj(x),  we  define 

“j(y)  = l/^(y  - i)  by 


(31)  ojy_i  (y)  = e 


aj-i  (y)  = exp(-l/aj  {y))  (j  = ly  - 1, . . . , 1); 
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compare  (21).  Then  the  starting  value  for  computing  the  c-sequence  can  be  ex- 
pressed as 


(32) 


c = 


'l±  ao{x)/ao{y) 
< 1 ± ao{x)ao{y) 

1 ± ao{y)/ao{x) 


(large), 

(mixed), 

(small). 


This  definition  is  equivalent  in  all  mathematical,  but  not  numerical,  respects  to 
(25). 

Some  implementation  details  are  omitted  here.  For  example,  the  division  in 
the  large  case  of  (32)  could,  for  fixed  finite  precision  arithmetic,  take  the  form  of 
0/0.  However,  under  our  assumptions  00(1)  < 00(2/)  so  that  if  ao{x)  = 0 then 
one  of  c = 0, 1 or  2 is  appropriate.  Such  considerations  were  dealt  with,  for  the 
various  cases,  in  [4]  and  [6]  and  can  be  similarly  treated  here.  The  remainder  of 
the  addition  and  subtraction  algorithm  is  performed  as  described  in  Section  3 of 
this  paper,  with  the  understanding  that  the  o-sequence  in  (26-28)  is  CLj{x).  The 
complete  modified  algorithm  for  SLI  addition  and  subtraction  is  summarized  for 
reference  and  comparison  as  Algorithm  1. 


Algorithm  1.  Modified  SLI  Addition  and  Subtraction  of  Positive  Arguments 
Input  {rx,x),  {rY,y),  and  Sop  = -1-1  (for  addition)  or  Sop  = —1  (for  subtraction) 
Initialize  Booleans  large,  mixed,  small,  and  = rx 
Compute  o-sequence  = aj(x)  and  ao(y) 

j = 1 

If  large  then  c = 1 -f  Sop^o/ao(3/) 

if  c < ^0  then  rz  = —rx,  h = — In  c/^o,  go  to  h-step 
If  mixed  then  c = 1 -f  So-ptoOK){y) 

if  c < ^0  then  rz  = —rx,  h = — Inc/^o,  go  to  h-step 
If  small  then  c = 1 -H  Sop“o(y)/^o 

if  c^o  > 1 then  rz  = —rx,  z = 1 -1-  Inc^Oj  go  to  Output 
If  £x  = ^ then  h = fx  -I-  r-jr  la  c,  go  to  h-step 
else  c = 1 -H  rx^i  In  c 
While  j < ix  — f and  c > (j 
j = j + l,c  = 1 + Inc 
If  c < then  z = j + c/^j,  go  to  Output 
j = £x,h  = fx  -I- Inc 


h-step  While  h>  1 

j = j + 1,  h = Ink 
z = j + h 
Output  {rz,z) 


The  stopping  conditions  for  the  c-sequence  are  simpler  than  they  appear.  For 
rx  = -|-1)  for  example,  the  first  condition  governs  addition;  the  second  governs 
subtraction.  At  most  one  step  of  the  /t-sequence  is  needed  for  addition.  More  are 
needed  only  for  the  most  severe  cases  of  cancellation. 

The  overall  structure  of  Algorithm  1 is  simpler  than  the  original  SLI  algorithm 
in  that  the  special  forms  of  the  6-sequence  (22)  and  (24)  are  not  needed.  Of  course 
the  error  analysis  of  the  algorithm  is  different  but  that  is  not  the  subject  of  the 
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present  work.  The  complexity  of  the  algorithm  in  terms  of  its  use  of  special  expo- 
nential and  logarithmic  functions  is  not  significantly  different  but  there  is  a more 
natural  parallelism  in  the  computation  of  the  two  a-sequences  that  facilitates  SIMD 
computation,  and  perhaps  also  computation  in  hardware. 

Extension  of  Algorithm  1 to  extended  summation  similar  to  that  described  in  [14] 
for  the  original  algorithm  is  readily  achieved.  This  is  summarized  in  Algorithm  2 
for  the  computation  of 

N N 

z = sz4>  ^ s^<t){xiy'  = Xj 

i=0  1=0 

where  we  cissume  that  4>{xoy°  > (p{xiY'  for  all  i and  that  sq  = -hi. 

Algorithm  2.  Modified  SLI  Summation  of  N Arguments 

(si, r,, Xt)i^o  ^0  = +1  and  |Xo|  > |Xt|  for  z > 1 
Booleans  large,  small,  and  rz  = tq 
o-sequence  = 0^(20)  and  ao(xi)  for  i > 1 

j = 1 

If  large  then  c=  1 , ^z  = sgnc,  c=  |c| 
if  c < ^0  then  rz  = — t’o,  h = — Inc/^o,  go  to  h-step 
If  small  then  c =:  1 -h  X]  sz  = sgn  c,c=  |c| 

if  c^o  > 1 then  rz  — — rg,  h — Inc^Oj  go  to  h-step 
If  ixa  = 1 then  h = fxo  + la  c,  go  to  h-step 
else  c = \ + ro^i  Inc 

Complete  the  algorithm  exactly  as  in  Algorithm  1 

For  serial  computation,  Algorithm  2 represents  a saving  of  approximately  66% 
since  the  repeated  serial  application  of  Algorithm  1 requires  N such  operations 
which  typically  entail  2N  a-sequences  and  N c-sequences.  A serial  implementation 
of  Algorithm  2 needs  just  [N  -hi)  a-sequences  and  just  a single  c-sequence.  In 
a parallel  environment.  Algorithm  2 heis  essentially  the  same  complexity  as  Algo- 
rithm 1:  all  the  a-sequences  can  be  computed  simultaneously  and  the  completion 
of  the  algorithm  is  unchanged.  The  only  extra  work  is  the  fixed-point  summation 
to  obtain  c which  can  use  an  efficient  reduction  algorithm.  Even  for  terms  of  mixed 
sign  it  is  unnecessary  to  exercise  special  care  over  the  arrangement  of  this  internal 
summation  because  all  internal  arithmetic  to  the  SLI  algorithms  is  fixed-point  in 
nature. 

4.2.  Multiplication  and  Division.  Again  we  adopt  the  restrictions  (14),  using 
Y/X  = (X/Y)~^  where  necessary.  Let  o denote  either  multiplication  or  division, 
and  consider  (15-17)  with  o in  place  of  ±.  The  only  cases  of  flipover  occur  for  mixed 
multiplication  with  (f>{x)  < and  for  small  division,  but  even  these  do  not  need 
special  treatment  in  the  algorithm.  The  following  table  shows  that  it  suffices  to 
consider  only  computations  of  the  form 


Input 

Initialize 

Compute 


(33) 

where  u>  v > 1: 


4>{z)  = (^(u)  o (f>{v) 
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Original  operands  Multiplication 

large  rz  = +1,  (t>{z)  = 4>{x)  * <i>{y) 

mixed  0(x)  > ^{y)  rz  = +1,  (i>{z)  = (f>{x)/4>{y) 

mixed  ^(x)  < 4>{y)  t'z  — -1,  <i>{z)  = (f>{y) / <f>{x) 

small  rz  = -1,  <p{z)  = <f>{y)  * <f>{x) 


Division 

rz  = +1,  (f>{z)  = 4>{x)/(t>{y) 
rz  = +1,  ^(2)  = (i>{x)  * 4>{y) 
rz  = +1,  (t>{z)  = (i>{y)  * (f>{x) 
rz  = +1,  <f>{z)  = 4>{y)l(f>{x) 


In  all  four  cases  we  have  z > 1 and  either  u = x,v  = y or  u = y,v  = x.  By 
analogy  with  (18),  (25)  and  (26)  we  can  write 


(34)  c = ao{v) 

where  cq  = for  multiplication  and  cq  = for  division.  Then  Algorithm  3 
proceeds  as  the  large  case  of  Algorithm  1 with  simplifications  because  flipover  is 
impossible.  The  initialization  (34)  has  the  merit  of  being  universally  applicable.  For 
a SIMD  or  potential  hardware  design,  this  may  be  preferable  to  the  initialization 
that  was  used  in  the  original  serial  algorithm. 


Algorithm  3.  Modified  SLI  Multiplication  and  Division  of  Positive  Arguments 
Input  u,v,  and  r = —1  (for  multiplication)  or  r*  = +1  (for  division) 
Compute  o-sequence  = o,j{u)  and  c = ao('y) 
j=l 

Complete  the  algorithm  exactly  as  in  Algorithm  1, 
but  with  £x  = ^u,  fx  = fu  and  rx  = r 


5.  Parallel  SLI  Algorithms  for  Basic  Linear  Algebra  Operations 

Two  of  the  more  important  low  level  BLAs  (basic  linear  algebra  operations)  are 
the  scalar  product  and  saxpy.  The  scalar  product  is  easily  performed  by  the  si- 
multaneous SLI  elementwise  multiplication  of  the  components  using  Algorithm  3, 
followed  by  the  summation  Algorithm  2.  This  idea  generalizes  in  the  obvious  man- 
ner to  all  the  standard  vector  norms.  This  has  been  discussed  using  the  original 
SLI  arithmetic  algorithms  in  [9]  and  [13]. 

It  is  possible  to  design  an  SLI  dot-product  operation  which  does  not  complete  all 
the  elementwise  products  but  instead  uses  the  information  from  the  o-sequences  to 
obtain  those  for  the  summands  and  therefore  to  reduce  the  complete  scalar  product 
operation  to  just  one  extended  SLI  operation.  The  saxpy  operation  is  a fundamental 
operation  of  the  Gauss  elimination  algorithm;  it  is  treated  similarly  in  Section  5.2. 

5.1.  Dot  Product.  Our  objective  here  is  to  obtain 

sz4>{zy^  =X^Y 

where  the  components  of  the  iV- vectors  X,Y  are  stored  in  SLI  form: 

and  = (y,),^,  = (^yXy^rOili 

The  first  step  is  to  rearrange  the  data  so  that  each  elementwise  product  is  of  the 
form  (33). 
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Simultaneously  for  each  i we  set 

Si  = SXi  • SYi 

Pi  = rxi  ■ rYi 

/ggx  Ui  = max(ii,yi) 

' Ui  = min(xi,yi) 

r = I 

* \ I’Yi  otherwise 

Then  the  required  dot  product  given  by 

N 

(36)  sz<t>{zy^  = y^.Si  {<t>{ui)  • (f>{viY'Y' 

1=1 

where  each  of  the  internal  operations  is  in  the  desired  form.  Our  objective  is 
however  to  compute  the  sum  without  completing  these  internal  multiplications  and 
divisions  explicitly. 

Although  we  do  not  compute  these  component  products,  it  will  be  useful  to 
denote  them  by  Wi.  That  is,  for  each  i = 1,2, . . . ,N 

(37)  <f){wi)  = 4>{ui)  ■ <i>{viY^ 

For  the  extended  summation  Algorithm  2,  the  o-sequence  of  each  component  was 
required.  Strictly,  the  full  sequence  is  required  only  for  the  largest  component  of 
the  sum;  for  the  others  just  ao(u;i)  suffices.  Using  (37),  we  have 

(38)  a.o{wi)  = ao(ui)  • ao{viY' . 

To  generate  the  initial  value  for  the  c-sequence  of  the  final  summation  without 
completing  these  products,  it  only  remains  to  identify  the  largest  component  of 
the  sum.  In  order  to  complete  the  summation,  we  shall  also  need  to  obtain  the 
complete  o-sequence  for  this  term. 

The  first  of  these  tcisks  is  simply  achieved.  The  function  oo(x)  is  monotone 
decreasing,  and  the  largest  term  in  the  sum  corresponds  to  min{ao(iOi)  : Vi  = +1} 
assuming  this  set  is  nonempty  and  to  max{ao(tOi)}  otherwise.  This  extreme  value 
can  be  obtained  by  the  usual  reduction  process.  Let  tt;  denote  this  largest  term: 

's(f>{wY  = 7 [(f>{u)(p{v)p) 

and  let  Aj  = cij[w)  and  aj  = aj(u).  Also,  we  shall  denote  by  Cj  the  c-sequence  of 
the  summation  and  by  Cj  that  for  (f){u)4>{yY  ■ The  sequence  aj  is  already  known. 
Also  Aq  is  given  by  (38)  for  the  appropriate  i.  From  (34),  we  have  c = ao{v)  and 
modifying  the  definition  in  the  summation  algorithm  to  this  situation 

_ N 

(39)  C = 5 - Aq  Sj  • ao{wi)~'^' 

■i=l 

with  the  various  terms  given  by  (38). 

It  remains  only  to  obtain  the  rest  of  the  sequence  Aj . The  rest  of  the  algorithm 
can  then  be  completed  just  as  for  the  regular  summation  using  the  recurrence 


Cl  — 1 -|-  rAi  In  C, 


Cj  — 1 -p  Aj  InCj  — i 
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and  any  terms  of  the  /i-sequence  which  may  be  needed. 

By  definition,  Aj  = j)  Cj  = (p{w  — j)/<f>{u  — j).  We 

already  have,  as  usual,  Cj  = 1 + aj  lncj_i.  Multiplying  the  first  two  of  these,  we 
get  AjCj  = \/4>{u  — j)  = a. j from  which  we  deduce  that 


^ 1 4-  aj  In  Cj  _ 1 

which  can  be  computed  in  parallel  with  Cj.  It  follows  therefore  that  the  required 
sequences  can  be  computed  in  (staggered)  parallel  so  that  (cj,  Aj,Cj-i)  are  all 
obtained  simultaneously  which  effectively  adds  just  one  step  to  the  c-sequence. 

The  overall  effect  of  this  algorithm  for  the  SLI  dot  product  is  that  the  complete 
operation  becomes  equivalent  to  just  an  extended  summation  (Algorithm  2)  which 
we  have  already  seen  has  essentially  the  same  complexity  as  a single  SLI  operation. 
Of  course  we  would  anticipate  that  it  is  necessary  to  use  a greater  fixed-point 
wordlength  for  the  reduction  summation  in  (39)  due  to  the  fact  that  each  summand 
is  obtained  from  its  factors  (38). 

5.2.  Saxpy.  In  this  section  we  turn  to  the  vector  operation 

Z = aX  + Y 

where  X,  Y,  Z are  W- vectors  and  a is  a scalar  with  all  components  given  in  their 
SLI  representations.  This  of  course  includes,  as  an  important  special  case,  multi- 
plying or  dividing  a vector  by  a constant.  The  parallelism  of  the  operations  for 
the  individual  components  of  Z is  apparent  and  so  we  concentrate  on  the  single 
multiply-accumulate  operation 

(40)  sz(f>{zY^  = Sa(f>{aY‘‘  ■sx<P{xY^  +sy<^(2/)’’'' 

Although  we  can  concentrate  on  just  one  such  operation,  it  is  necessary  to  observe 
that  we  cannot  insist  on  any  particular  (magnitude)  ordering  among  the  operands 
or  the  partial  results  since  all  possible  combinations  may  be  encountered  within  a 
single  SLI  saxpy  operation. 

The  signs  Sa  and  Sx  can  be  immediately  combined  to  yield  the  sign  s-^j  of  the 
product  term  which  we  denote  temporarily  by 

swHwY'^  = {sccsx)  ■ (t>{aY"‘  ■ <l>{xY^ 

The  two  factors  of  the  product  can  also  be  arranged  as  for  SLI  multiplication  so 
that  (cf.  (33)) 

(/)(w)  = (f){u)  o 4>{v) 

with  It  > ^;  > 1: 

Original  operands  x>a^u=x,  v = a x<a=>u  = a,  v = x 

large,  rx  = Xa  = -|-1  rw  = +1,  o = * rw  = +1,  o = * 

mixed,  rx  = +1, Ta  = -1  rw  = +1,  o = / rw  = -l,  o = / 

mixed,  rx  = —1, 7’a  = +1  rw  = —1,  o = / rw  = +1,  o = / 

small,  rx  = ra  = —1  rw  = —1,  o = * rw  = —1,  o = * 
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Now,  in  a similar  manner  to  that  used  for  the  dot  product,  we  have 


(41)  ao(w)  = ao(u)  ■ ao(vy 

where,  as  in  Algorithm  3,  r = ^1  for  multiplication  and  division  respectively.  In 
order  to  define  c for  this  combined  operation  we  must  determine  which  is  the  larger 
operand  for  the  addition.  Again  the  decreasing  nature  of  the  function  ao(»)  is  used 


help  decide  this  as  follows: 

Operands 

sz 

c definition 

large 

aQ{w)  < ao(y) 

sw 

aj{w) 

c = 1 + ao{w)/ao[y) 

Vw  = Ty 

= -hi  aoiw)  > ao{y) 

Sy 

“i(2/) 

c = 1 -h  ao{y)laQ{w) 

mixed 

rw  = +1)  = —1 

aj{w) 

0=1-1-  ao(tu)  • ao(2/) 

T'w  = — 1)  T’y  = +1 

Sy 

o-i{y) 

c = 1 -t-  ao(u;)  • ao(3/) 

small 

ao{w)  < ao{y) 

sy 

“j(y) 

c = 1 + ao(u;)/ao(y) 

— Ty 

= -1  oo(iu)  > ao{y) 

Sw 

aj{w) 

c = 1 -h  ao{y)/ao{w) 

In  the  cases  where  = aj(y)  the  algorithm  can  now  be  completed  exactly  as 
in  the  standard  SLI  addition  Algorithm  1.  For  the  other  cases,  the  quantities 
= aj(w)  are  not  readily  available  and  must  be  computed  from  the  sequences 
aj(u),aj(v).  It  is  on  these  ceises  that  we  concentrate.  The  details  here  are  similar 
to  those  of  the  dot  product  algorithm  in  that  again  the  o-sequence  which  is  needed 
must  be  obtained  on-the-fly  and  this  entails  the  simultaneous  computation  of  two 
related  c-sequences. 

At  the  beginning  of  this  phase  of  the  algorithm,  for  the  cases  where  = aj{w), 
we  have,  summarizing  the  above  table, 

c = 1 + ao{wY'^  ■ ao(y)“’'^ 


where  = ao(’U')  is  given  by  (41).  We  also  have  available  the  o-sequence  aj{u)  and 
the  initial  value  of  the  c-sequence  for  the  multiplication  given  by  b = 00(1;).  The 
first  step  then  consists  of  (simultaneously)  computing 


61  = 1 -f  7'ai(u)  • In  6 , 


6 


Qi(^) 

1 + rai{u)  ■ In  b 


while  the  subsequent  steps  require  three  simultaneous  (and  very  similar)  computa- 
tions: 


except  Cl  = 1 -(-  riv’^ilnc.  These  steps  can  be  completed  as  far  as  obtaining 
and  cx,_i  where  L = [o]  is  the  level  of  u.  At  this  stage,  if  further  steps 
are  needed  the  algorithm  may  be  completed  in  a similar  fashion  to  Algorithm  1. 
By  definition, 

(i>{z  — L 1) 

(t>{w  - L + 1) 

from  which  it  follows  that 


Hl  = <t>{z  — L)  = — L + 1)  = ln(^(u;  — i -t-  1)  -i-  lncx,_i 

Now,  since  ln^(io  — i -I-  1)  = In  4){u  — Z -f  1)  -f  In  this  yields 

(42)  hi  = /u -I- ln6i_i -I- IncL-i 
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where  = u — L is  the  index  of  u. 

Finally,  ii  hi  > 1,  the  algorithm  is  completed  by  computing  as  many  additional 
terms  of  the  h-sequence  as  necessary.  Since  each  of  the  underlying  operations  can 
increase  the  level  by  at  most  one,  no  more  than  two  such  steps  are  required.  When 
= aj[y),  at  most  one  step  of  the  corresponding  h-sequence  is  needed. 

In  a serial  computing  environment,  each  component  of  the  resulting  vector  is 
computed  using  three  o-sequences  and  two  c-sequences  which  represents  only  a 
relatively  small  (approximately  17%)  saving  relative  to  performing  the  SLI  multi- 
plication and  then  the  addition  each  requiring  two  a-sequences  and  a c-sequence. 
In  a SIMD  parallel  environment  (with  sufficient  processors)  all  the  a-sequences  are 
computable  simultaneously  as  are  all  the  c-sequences  so  that  the  complete  paral- 
lel saxpy  operation  has  similar  parallel  complexity  to  Algorithm  1 for  scalar  SLI 
addition. 


6.  Conclusions 

In  this  paper  we  have  demonstrated  that  fundamental  vector  and  scalar  processes 
in  SLI  arithmetic  have  essentially  the  same  computational  complexity  through  ef- 
fective use  of  parallel  recursive  algorithms.  The  main  source  of  this  parallelism  is  in 
the  simultaneous  computation  of  sequences  which  are  independent  of  one  another, 
essentially  one  for  each  component  of  the  vector  operands.  This  kind  of  parallelism, 
which  is  unavailable  in  floating-point  arithmetic,  makes  SLI  especially  attractive  for 
numerical  linear  algebra. 

We  discussed  also  a software  implementation  that  is  being  developed  on  a 4096- 
processor  SIMD  parallel  computer  system.  This  system  is  ideal  for  demonstrating 
the  parallel  advantages  of  SLI  algorithms  with  a view  toward  a possible  future 
hardware  design.  Currently  the  software  implementation  includes  individual  SLI 
arithmetic  operations  on  scalar  and  vector  operands.  It  is  being  extended  to  include 
all  the  algorithms  discussed  in  this  paper.  Ultimately  it  will  be  used  to  solve  linear 
algebraic  systems  in  SLI  arithmetic  for  demonstration  and  comparison  purposes. 

All  SLI  algorithms  can  be  executed  in  fixed-point  arithmetic.  A suitable  number 
of  guard  digits  is  needed,  as  determined  by  appropriate  error  analysis.  Results 
have  been  obtained  by  a priori  error  analysis  for  individual  arithmetic  operations 
in  earlier  papers,  and  these  were  incorporated  into  our  software  implementation. 
Further  work  in  error  analysis  will  be  the  subject  of  future  papers. 
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